A Theory of Speech in Multimodal Systems

نویسندگان

  • Niels Ole Bernsen
  • Laila Dybkjær
چکیده

Increasingly, speech input and/or speech output is being used in combination with other modalities for the representation and exchange of information with, or mediated by, computer systems. Therefore, a growing number of developers of systems and interfaces are faced with the question of whether or not to use speech input and/or speech output in multimodal combinations for the applications they are about to build. This paper presents first results on speech in multimodal systems from a test of a theory-based approach to speech functionality. The test used a large corpus of claims about speech functionality derived from the recent literature. 1. SPEECH FUNCTIONALITY The speech functionality problem is the question of what speech is good or bad for, or under which conditions to use, or not to use, speech for information representation and exchange either speech alone or in combination with other modalities. With the rapid spread of speech technologies, the speech functionality problem has become one of real practical importance. The research literature is becoming replete with studies of speech functionality including speech in multimodal systems, such as speech and multimedia [1], speech and graphics [2,3], speech and gesture [4], speech in auditory interfaces [5,6], speech, pen and graphics [7,8,9,10], email vs. voice mail [11]. It seems unlikely, however, that empirical studies will suffice in telling system developers what they need to know in a timely fashion in order to avoid user dissatisfaction or poor system performance due to erroneous choices of modality combinations. This is due to the complexity of the speech functionality problem (Figure 1). The combinatorics described in Figure 1 is daunting. If possible at all, it would take decades of empirical experimentation to investigate all the possibilities. There are several speech modalities, such as keywords and unrestricted discourse; there is speech as input and speech as output; there are scores of non-speech modalities with which speech might conceivably be combined; and the success of a particular modality choice is subject to an unlimited number of instantiated domain variables, including task type (e.g. navigating hypermedia), communicative act (e.g. alarm), user group (the blind), work environment (natural field settings), system type (e.g. personal intelligent assistant), performance parameters (e.g. more efficient), learning parameters (e.g. learning overhead), and cognitive properties (e.g. attention load). In other words, it would be useful for developers to be able to rely largely on comprehensible theoretical guidance instead of lengthy experimentation. This paper reports on the results of a recent study of how it might be possible to support developers’ reasoning about speech functionality, emphasizing the use of speech in a multimodal context. [combined speech input/output, speech output, or speech input modalities M1, M2 and/or M3 etc.] or [speech modality M1, M2 and/or M3 etc. in combination with nonspeech modalities NSM1, NSM2 and/or NSM3 etc.] are [useful or not useful] for [generic task GT and/or s p e e c h act type SA and/or user group UG and/or in terac t ion mode IM and/or work environment WE and/or g e n e r i c sys tem GS and/or performance parameter PP and/or learning parameter LP and/or cognitive property CP] and/or [preferable or non-preferable] to [alternative modalities AM1, AM2 and/or AM3 etc.] and/or [useful on conditions] C1, C2 and/or C3 etc. Figure 1. The complexity of the problem of accounting for the functionality of speech in systems and interface design. Domain variables are in boldface. 2. AN ENCOURAGING RESULT Given the huge complexity described in Section 1, it is a striking fact that the only constant property of claims about speech functionality, such as “Speech input is useful when the user’s hands are occupied”, is that the claims involve, often oblique, reference to objective modality properties, such as that speech is omnidirectional or is eyes-free. The purpose of Modality Theory [12,13] is to describe the objective properties of all unimodal modalities in acoustics, graphics and haptics. The observation that all speech functionality claims refer to modality properties gave rise to the idea of testing the explanatory power of Modality Theory on a small but well-defined fragment within the scope of the theory, i.e. a set of claims about speech functionality. Using as data points 120 claims about speech functionality that were systematically gathered from papers dedicated to the issue [14], it was shown that a mere 18 modality properties (Figure 2), were sufficient to justify, support or correct 106 (97%) of the 109 claims that were not flawed in one way or another [15]. The 18 modality properties were taken from Modality Theory and include all the properties that the theory could contribute to the claims analysis. All claims could be categorised as belonging to one of 13 types (Figure 3). Eleven of the 13 types were represented in the data. ! " # $ %% %&" ' ISCA Archive

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cipher text only attack on speech time scrambling systems using correction of audio spectrogram

Recently permutation multimedia ciphers were broken in a chosen-plaintext scenario. That attack models a very resourceful adversary which may not always be the case. To show insecurity of these ciphers, we present a cipher-text only attack on speech permutation ciphers. We show inherent redundancies of speech can pave the path for a successful cipher-text only attack. To that end, regularities ...

متن کامل

Multimodal medical image fusion based on Yager’s intuitionistic fuzzy sets

The objective of image fusion for medical images is to combine multiple images obtained from various sources into a single image suitable for better diagnosis. Most of the state-of-the-art image fusing technique is based on nonfuzzy sets, and the fused image so obtained lags with complementary information. Intuitionistic fuzzy sets (IFS) are determined to be more suitable for civilian, and medi...

متن کامل

Uprising in “Uprising”: A Multimodal Analysis of Bob Marley’s Lyrics

This paper investigates how the theme of uprising is conveyed in Bob Marley’s final music album by the name “Uprising”. Through the methodological lenses of multimodality, attention is focused on how the album cover design, lexical items, literary devices, and other aesthetic ways such as the titles of the ten songs of the album and their order of arrangement contribute to the overall theme of ...

متن کامل

SOLVING BEST PATH PROBLEM ON MULTIMODAL TRANSPORTATION NETWORKS WITH FUZZY COSTS

Numerous algorithms have been proposed to solve the shortest-pathproblem; many of them consider a single-mode network and crispcosts. Other attempts have addressed the problem of fuzzy costs ina single-mode network, the so-called fuzzy shortest-path problem(FSPP). The main contribution of the present work is to solve theoptimum path problem in a multimodal transportation network, inwhich the co...

متن کامل

Fuzzy particle swarm optimization with nearest-better neighborhood for multimodal optimization

In the last decades, many efforts have been made to solve multimodal optimization problems using Particle Swarm Optimization (PSO). To produce good results, these PSO algorithms need to specify some niching parameters to define the local neighborhood. In this paper, our motivation is to propose the novel neighborhood structures that remove undesirable niching parameters without sacrificing perf...

متن کامل

Deliverable D 2 . 10 Working Paper on Speech Functionality April 1999 Esprit Long - Term Research

Increasingly, speech input and/or speech output is being used in combination with other modalities for the representation and exchange of information with, or mediated by, computer systems. Therefore, a growing number of developers of systems and interfaces are faced with the question of whether or not to use speech input and/or speech output in multimodal combinations for the applications they...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999